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ABSTRACT 



This study examines the effects on the National Assessment 
of Educational Progress (NAEP) achievement levels of using item response 
theory (IRT) models that have nominal missing-response parameters. It 
compares outpoints based on item parameters that were fitted using two 
different models. The first set of outpoints were based on parameters for the 
two- and three-parameter logistic model, and the second set of outpoints were 
based on R. Bock's (1972) nominal model. Data are from the 1992 NAEP in 
mathematics and reading for grade 12. For reading, data included 1,966 
responses to a block of items, and for mathematics, data included 2,192 
responses. Other data were the item-by-item ratings by each panelist who 
participated in the Achievement Levels Setting process for NAEP 1992. For 
each subject, the outpoints set using the probability curves obtained by the 
different models were compared. The percent of students scoring at each level 
were also compared. For reading, the logistic model, when fitted to the data, 
converged in 25 iterations and yielded a marginal reliability of 0.67 with 
maximum information of 4.3 at theta equals -0.5. The nominal model converged 
in 88 iterations and yielded a marginal reliability of 0.85 with a maximum 
information of 14.2 at theta equals 0. For mathematics, the logistic model 
converged in 184 iterations and yielded a marginal reliability of 0.61 with 
maximum information of 23.9 at theta equals 1.5. The nominal model, with 
mathematics data, converged in 46 iterations, and yielded a marginal 
reliability of 0.62, with maximum information of 7.3 at theta equals -0.2. 
When the percentages of students scoring at or above each cutpoint were 
compared, none scored at the Advanced level using the nominal model. These 
preliminary results suggest the direction of future studies, but cannot be 
generalized to the NAEP assessment program. An appendix contains achievement 
levels descriptions. (Contains three tables, six figures, and five 
references . ) (SLD) 
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Background of the Study : The National Assessment of Educational Progress 
(NAEP) is one of the large-scale assessment programs that develop tests that 
combine multiple-choice and open-ended items. In the 1992 NAEP in 
Mathematics, for example, there were 99 multiple-choice, 54 short answer, and 
five extended answer items for grade four, 118 multiple-choice, 59 short answer, 
and six extended answer items for grade eight, and 115 multiple-choice, 58 short 
answer, and six extended answer items for grade twelve. Overall, 64% of the 
items were multiple-choice, 33% were short answer, and three percent were 
extended answer. The NAEP items were administered in blocks, each of which is 
a combination of multiple-choice and constructed response items. 

In 1994, Swinton, in his paper titled Scoring with Nominal Missing- 
Response Parameters, reported that the increasing proportion of open-ended items 
in the NAEP corresponded to a rise in the number of nonresponse to those items. 
He further stated that "this problem is exacerbated when multiple-choice and 
open-ended items are presented in the same block, with at least one multiple- 
choice item following an open-ended item" (p. 1). This "exacerbation" results from 



the potential for an examinee to attempt the multiple-choice items first, then go 
back to open-ended items (if there is more time). This invalidates NAEP’s 
traditional scoring approach that considers omitted items to be incorrect, and "Not 
Reached" items to be missing. Items treated as missing do not affect the level of 
an examinee’s ability estimate. Such a scoring approach is only valid under the 
assumption that students attempt items in sequential order. Swinton (1994) 
suggested that if one cannot reliably distinguish omitting- from nonreaching- 
behavior, an option is to attempt to model nonresponse in the scoring process. 

The study reported in this paper is an extension of Swinton’s 1994 study. It 
examines the effects on the NAEP achievement levels of using IRT models that 
have nominal missing-response parameters. It compares cutpoints based on item 
parameters that were fitted using two different models. The first set of cutpoints 
were based on parameters for the two- and three-parameter logistic model. The 
second set of cutpoints were based on Bock’s (1972) nominal model. 

Data 1 : Data used for this study are from the 1992 NAEP in Mathematics and 
Reading for grade 12. For the purposes of this study, only one block of items from 
each subject was used. For Reading, the data included the responses of 1,966 
grade- 12/age- 17 students to items in block RD. This block included three 
multiple-choice (M) items, five short open-ended (O) items, and one extended 

x The authors wish to thank the Center for Assessment of Educational Progress 
and the Educational Testing Service for the data sets that they provided for this 
study. 




2 



response (E) item. For Mathematics, the data set included responses of 2,192 
grade- 12/age- 17 students to items in block M15. This block included six multiple- 
choice items, three short open-ended items, and one extended response item. 

Codes for responses to multiple-choice items correspond to the five choices, plus 
three categories corresponding to omitted items, not reached items, and multiple 
response items. Codes for responses to open-ended items correspond to the score 
levels, plus three categories corresponding to omits, not reached, and off-task. The 
order of the different types of items in each block and their nonresponse rates ( i.e ., 
rates of omits and not reached) are in Table 1. Notice that the nonresponse rate 
is very high in Reading, especially for those items that come later in the block. 

The other data sets that were used are the item-by-item ratings by each 
panelist who participated in the Achievement Levels-Setting (ALS) process in 
NAEP Mathematics and Reading in 1992. There were 10 raters in grade 12 
Reading and 11 raters in grade 12 Mathematics. The data for each rater included 
a modified Angoff rating for each dichotomous item. 2 At each achievement level, 
Basic, Proficient, or Advanced, the rater provided his/her best estimate of the 
probability that a student performing at the lower borderline of each level would 
respond to the item correctly. Although this was done in three rounds, only the 
third round ratings were used in setting the cutpoints. Thus, in this study, only 
the third round ratings were used. 

2 Polytomous items were rated using the paper selection method. However, 
polytomous items will not be considered for this study for reasons that will be 
discussed in the Method section. 



Method : The responses to dichotomous items were recoded so that there were 



only four response codes: "l"=correct, "0"= incorrect, "n"=not reached, and 
"o"=omit. Two different IRT models were fit to each response data set. Thissen’s 
(1990) MULTILOG PC program was used to estimate the parameters and score 
the test. After the parameters were obtained, the probability functions were used 
to map the modified Angoff ratings to compute the cutpoints corresponding to the 
Achievement Levels Descriptions of Basic, Proficient, and Advanced. (Please see 
Appendix A for the Achievement Levels Descriptions.) 

The first model that was fitted was the two-parameter logistic model, 
expanded to three parameters for multiple-choice items. In this model, omits were 
considered wrong and not reached were coded missing. This approximates the 
current NAEP scoring practice. Figure 1 shows a typical item characteristic curve 
(ICC) based on a three parameter logistic model. The second model that was 
fitted was Bock’s (1972) nominal model. Unlike the previous model, Bock’s model 
does not yield a logistic trace line for each item. Instead, it produces curves that 
are ratios of a category-specific exponential to the sum of the exponentials for each 
category. Three categories were used for this model: 1 = no response, 2 = 
incorrect, and 3 = correct. The "no response" category is the combination of omits 
and not reached. A typical set of response characteristic curves for the nominal 
model is shown in Figure 2. Each of these models were fitted to each of the 
response data sets. In each case a normal prior was assumed. 

The graded response (GR) model (Samejima, 1969) was considered for this 
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study. In the GR model an a priori ordering of the categories is required, but 
there is no completely unambiguous way of ordering nonresponse and incorrect 
categories. For this reason, the GR model was not included for comparison. 

Using the parameters estimated for each model, the probabilities of a 
correct response at each point on the 0-scale were summed to produce an expected 
test score. Thus, for the logistic model, the ICCs were added together to form a 
test score function (TSF). For the nominal model, the probability curves for the 
correct responses were s umm ed to form a TSF. The TSFs were used to map the 
modified Angoff ratings to the theta scale to produce the cutpoints. 

To map the modified Angoff ratings to the 0-scale, let 
I - the number of items, and 
J = the number of raters. 

Suppose 

r Xij = raters j’s estimate that a student performing at the borderline of the 
X achievement level will respond to item i correctly; 
where X = Basic, Proficient, or Advanced, 

i = 1 , 2 , ..., /, 

j = 1, 2, ..., J. 

These estimates or ratings are summed across items, and the sums are averaged 
across raters. If 
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then, r x is mapped to the theta scale using the TSF. The value 0* is the lower 
borderline, or cutpoint, of the X achievement level. For example, using the TSF in 
Figure 3, if r^^nt = 4.36 then the lower borderline of the Proficient achievement 
level is 0^^ = -0.40. 

Polytomous items were not used for this study because they cannot be 
included in the TSF for the nominal model. The current procedures use the 
partial credit (PC) model for polytomous items. The PC model produces a 
probability curve for each score level of a polytomous item. Suppose a polytomous 
item has four score levels and, given 0, the probability of getting a score of n, 
where n = 1, 2, 3, or 4, is P(X=n/Q) based on the partial credit model. Then sum 
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will form the expected score curve for that item. At each value of 0, the expected 
score for polytomous items and the probability of getting correct answers for 
dichotomous items are summed to produce the TSF. When using the nominal 
model with a nonresponse category there is a problem in for min g the expected 
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